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Abstract 

Frequent itemsets mining is a fundamental primitive in data mining, requiring to identify all itemsets that appear 
in a fraction at least 6 of the transactional dataset. However, a transactional dataset only represents a sample from 
the underlying process that generates the data, the understanding of which is the ultimate goal of data mining. In 
general, the generative process yields transactions according to a probability distribution. Therefore, the output of 
traditional frequent itemsets mining algorithms can and tipically does contain a large number of spurious patterns that 
only happen to have a support larger than the minimum threshold 6 in the dataset at hand because of the stochasticity 
inherent in the dataset generation process. In order for the end user to take informed decisions using the mining 
results, it is necessary that the returned collection only contains Real Frequent Itemsets (RFI's), i.e., itemsets A 
such that their real frequency (the probability that A appear in a transaction) is greater or equal to the the minimum 
threshold 6. In this work we present an algorithm to extract a collection C of RFI's while keeping the probability 
that one or more spurious itemsets are included in C to within a user-specified limit. In other words, we present a 
statistical test for RFI's for which we can guarantee that the Family- Wise Error Rate is within the user-specified limits. 
We use results from statistical learning theory involving the Vapnik-Chervonenkis (VC) dimension of the problem at 
hand to accomplish this goal. This allows us to achieve, on the same data, much stricter bounds on the probability of 
a Type I error than what could be done using traditional multiple hypothesis testing corrections. In our experimental 
evaluation we show empirically that our test has very high statistical power, i.e., the output collection contains a large 
fraction of the RFI's. 

Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications - Data Mining 
Keywords: Frequent itemsets, Statistical test, VC-dimension, Type-1 error, Multiple hypothesis testing, Family- 
wise error rate. 

1 Introduction 

The mining of association rules is one of the fundamental primitives in the process of knowledge discovery from large 
data bases. In its most general definition, the problem can be reduced to identifying set of items, or itemsets, that 
appear in a fraction at least 9 of all transactions, where 9 is provided in input by the user. 

One of the main issues in frequent itemsets mining is the presence of spurious discoveries, or false positives, in 
the output. These are itemsets that are reported in output even if their appearance is due only to random associations 
in the datasets, and are therefore not significant. The presence of false positives undermines the success of subsequent 
analyses based on frequent itemsets. 

A number of approaches have been recently proposed to identify significant itemsets, whose appearance in the 
dataset is not due to random associations. These approaches are based on statistical tests that assess the significance of 
the itemsets by defining a random model that captures the properties characterizing the association between the items 
in a itemset. 

For example, consider two items a and b each appearing in 5% of the transactions of a dataset, and assume that we 
are interested in significant itemsets that appear in at least 2% of the transactions. In order to avoid to report spurious 
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discoveries, one can test if the frequency of the itemset {a, b} can be explained entirely by the frequencies of a and b, 
by comparing the frequency of {a, b} with the distribution of the frequency under the hypothesis that a and b are placed 
independently into transactions. For example, if {a, b} appears in 3% of the transactions, they are highly associated, 
since the probability that {a, b} appears in at least 3% of the transactions if a and b are independent is only 6 x 10 -4 . 
This is an example of a statistical test, in which the probability, or p- value, that a measure, or statistic (in the example, 
the frequency of {a, b}) is at least as extreme as the value observed in real data is computed under a null hypothesis 
that captures the properties of spurious discoveries (in the example, the independence of items). When the p-value 
is small enough, the itemset is flagged as significant, otherwise the itemset is discarded as a spurious discovery. A 
number of different procedures ||6][T4l[l7]|2T]|28l|35]|38] have been proposed in recent years to control the number of 
spurious discoveries that are reported after the frequent itemsets are identified. These procedures takes into account 
the fact that a transactional dataset contains a number of patterns, and the assessment of their significance therefore is 
a multiple hypothesis testing problem. 

However, there is another source of false discoveries among frequent patterns that has received scant attention 
in the literature: the observed transactional dataset is only a finite sample from all possible transactions that may be 
generated by the process resulting in the transactional dataset. In its most general form, this process can be thought of 
as a sampling process from an unknown distribution p over all possible transactions, and in reality one is not interested 
in the itemsets that are frequent in the observed dataset, but in the itemsets that are frequent according to p, in the 
sense that the probability that they appear in a transaction sampled from p is at least 9. 

For example, consider the transactions given by items that are bought together on Amazon; after observing a certain 
number of transactions, one is interested in inferring association between items that are valid for the distribution over 
all possible purchases, not only for the current, observed set T> of purchases that represents only a partial observation 
obtained from the distribution that includes also purchases that have not been recorded in the dataset, or that will 
materialize only in the future. 

The itemsets that are not frequent in p but are frequent in T> are false discoveries, and are not going to be filtered by 
the statistical tests described above, since such tests assume that the itemsets that are frequent in V, whose significance 
they assess, represent the frequent itemsets in p as well. In fact, these tests define itemsets as spurious by considering 
properties of the itemsets other than its frequency. In the example above, the co-occurrence of {a, b} is likely not due 
to random chance; however, the probability that {a, 6} appears with frequency 3% in T) while its frequency in p is 1% 
is 0.08, therefore if 9 = 2% then {a, 6} likely is a false discovery. As noted by Liu et al. [32], the phase of assessment 
of the significance of the frequent itemsets cannot replace the role played by the minimum support threshold 9, that is 
to reflect the level of domain significance, and is to be used in concert with statistical significance to filter uninteresting 
patterns that arise from different sources. Therefore being able to rigorously identify frequent patterns in p is crucial 
in order to obtain high quality patterns. 

In this paper we address the problem of identifying itemsets that appear with probability at least 9 in p, that we 
call Real Frequent Itemsets (RFI), while providing rigorous probabilistic guarantees on the number of false discover- 
ies, without making any assumption on the particular generative model of the transactions. This makes our method 
completely distribution free . In particular, we focus on returning a set of RFI with bounded Family-Wise Error Rate 
(FWER), that is the probability that one or more false discovery is reported among the RFI. A recently proposed alter- 
native to bound the FWER is to bound the False Discovery Rate (FDR), that in our case correspond to the proportion 
of false discoveries among the RFI. The use of the FDR allows to produce in output a larger number of patterns, since a 
small proportion of false discoveries are tolerated in output; however, in data mining the number of patterns produced 
is usually high, therefore having a smaller number of high quality discoveries is preferable to reporting a larger number 
of patterns containing some false discoveries. 

We stress that we do not assume any generative model on the transactions that are observed in the transactional 
dataset. In fact, we only assume that the transactions in the datasets are independent samples from the distribution p, 
without any constraint on the properties of p. This is in contrast with the assumptions that are made by the methods that 
assess the significance after the frequent patterns have been identified. In fact, these methods require a well specified, 
limited model to characterize the significance of a pattern, as it is in the case of independence in the example above. 
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1.1 Our contributions 



We introduce a rigorous statistical test to identify real frequent itemsets that guarantees that the Family-Wise Error Rate 
is within the user-specified limits. The test is based on mathematical tools from statistical learning theory. We define 
a range set associated to the problem at hand and give an upper bound to its (empirical) VC-dimension, showing an 
interesting connection with a variant of the knapsack optimization problem. To the best of our knowledge, ours is the 
first work to apply these techniques to the field of RFI's, and in general the first application of the sample complexity 
bound based on empirical VC-Dimension to the field of data mining. We implemented our test and evaluated its 
performances. First of all, we assessed the values and the behaviour of key parameters of our method, finding them 
very reasonable. Secondly, we checked how well the test controls the FWER: we noticed that it performs in practice 
even better than what the theory guarantees. Thirdly, we empirically evaluated the statistical power of our method, 
noticing that only a small fraction of the RFI's is not included in the output collection, i.e., that the test has a high 
statistical power. We compared it with other available tests to extract RFI's and found the power of our method 
comparable or even better than the current state of the art. Lastly, we tested whether the frequency in the dataset is a 
good estimator for the real frequency of a RFI, answering this questions positively. 

Outline The article is organized as follows. In Section [2] we review relevant previous contributions. Section [3] 
contains preliminaries to formally define the problem and key concepts that we will use throughout the work. Our 
statistical test is described and analyzed in Section [4] We present the methodology and results of our experimental 
evaluation of the test in Section [5] Conclusions and future work are presented in Section [6] 

2 Previous work 

Given a sufficiently low minimum frequency threshold, traditional itemsets mining algorithms can return a collection 
of frequent patterns so large to become almost uninformative to the human user. The quest for reducing the number 
of patterns given in output has been developing along two different different directions suggesting non-mutually- 
exclusive approaches. One of these lines of research starts from the observation that the information contained in a set 
of patterns can be compressed with or without loss to a much smaller collection. This lead to the definition of concepts 
like closed, maximal, non-derivable itemsets. This is not the approach taken in this work and we refer the interested 
reader to the survey by Calders et al. J9). 

The intuition at the basis of the second approach to reduce the number of patterns given in output by traditional 
itemsets mining algorithms consists in observing that a large portion of the patterns may be spurious, i.e., not actually 
interesting but only a consequence of the fact that the dataset is just a sample from the underlying process that generates 
the data, the understanding of which is the ultimate goal of data mining. This observation led to a prolification of 
interestingness measures. In this work we are interested in a very specific definition of interestingness that is based on 
statistical properties of the patterns. We refer the reader to ll22l Sect. 3] and 1 15 1 for surveys on different measures. 

A number of works explored the idea to use statistical properties of the patterns in order to assess their interest- 
ingness. Most of these works are focused on association rules, but some results can be applied to itemsets. In these 
works, the notion of interestingness is related to the deviation between the actual support of a pattern in the dataset and 
its expected support in a random dataset generated according to a statistical model that can incorporate prior belief and 
that can be updated during the mining process to ensure that the most "surprising" patterns are extracted. In many pre- 
vious works, the statistical model was a simple independence model: an item belongs to a transaction independently 
from other items (6] [14] [T71 [UJ [28] [35] [38] . Other works used Bayesian networks to express the prior belief l27l . In 
contrast, our work does not assume any statistical model for data generation, or better, does not impose any restriction 
on the model, with the result that our method is as general as possible. Moreover, the interestingness of a pattern 
is determined by its probability according to the distribution that regulates the generation of transactions (on which, 
again, we do not impose any limitation) and whether this support is greater than a user-specified minimum threshold. 

Kirsch et al. J28 1 developed a multi-hypothesis testing procedure to identify the best support threshold such that 
the number of itemsets with at least such support deviates significantly from its expectation in a random dataset of the 
same size and with the same frequency distribution for the individual items. In our work, the minimum threshold is 
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an input parameter fixed by the user, and we return a collection of itemsets such that they all have a support at least as 
high as the threshold with respect to the distribution that generates the sample data. 

Bolton et al. [6| suggest that, in pattern extraction settings, it may be more relevant to use methods to bound the 
False Discovery Rate rather than the Family-Wise Error Rate, due to the high number of statistical tests involved. In 
our experimental evaluation we noticed that the high number of tests is a problem when using traditional multiple- 
hypothesis correction techniques. The method we present does not incur in this issue because it considers all the 
itemsets together, without the need to test each of them singularly. 

Lallich et al. Il30ll introduce User Adjusted Family-Wise Error Rate (UAFWER), a flexible variant of FWER which 
allows an user specified number of false discoveries, rather than no false discovery as in the standard FWER definition. 
They present a bootstrap-based method to evaluate the quality of a collection of association rules by comparing their 
empirical interestingness with the interestingness in a sequence of random datasets obtained by resampling the original 
data. We found that we can easily bound the FWER, so we have no need to use the UAFWER. 

Gionis et al. |flT| present a method to create random datasets that can act as samples from a distribution satisfying 
an assumed generative model. The main idea is to swap items in a given dataset while keeping the length of the 
transactions and the sum over the columns constant. This method is only applicable if one can actually derive a 
procedure to perform the swapping in such a way that the generated datasets are indeed sample. This procedure is 
problem dependent and there does not seem to be a way to derive one for the problem we are interested in. 

Webb [42] proposes the use of established statistical techniques to control the risk of Type-1 errors. One method is 
based on the Bonferroni and Holm correction, where the significance level is decreased proportionally to the number 
of tested hypotheses and each of them is tested separately. In the second method (called holdout), the available data 
are split into two parts: one is used for pattern discovery, while the second is used to verify the significance of the 
discovered patterns, testing one statistical hypothesis at a time. A new method (layered critical values) to choose the 
critical values when using a direct adjustment technique to control the FWER is presented by Webb l43l and works by 
exploiting the itemset lattice. Our method can be used when the data cannot be split, as can be the case for graphs or 
spatial data. In our experimental evaluation we compared the statistical power of the test we propose with the power 
of the holdout method, showing that neither is uniformly better than the other. We tried to apply the method based on 
the Bonferroni/Holm correction, and the layered critical value approach, but the very high number of itemsets to take 
into consideration makes these methods very inefficient in practice, to the point of hitting the precision limit of the 
computing platform. We refer the interested reader Section 5.5 for additional details. 

Hanhijarvi [23] presents a direct adjustment method to bound the FWER while taking into consideration the actual 
number of hypotheses that one needs to test. The dataset is resampled (using the method from [ 17]) in order to adjust 
the p-values in such a way that the FWER is within the desired bounds. This can be done in the setting of [23 1 because 
the considered null hypothesis for each itemsets is that the items in it appear independently from each other in the 
transactions. As we already argued, it is not clear that, in the case of RFI's, it is possible to derive a procedure to 
resample the dataset while guaranteeing that the generated datasets come from the null distribution. 

Liu et al. Il32l conduct an experimental evaluation of methods to control the false positives that are based on direct 
corrections, holdout data, and random permutations. They test the methods on a very specific problem (association 
rules for binary classification). 

In contrast with the methods presented in these works, ours does not need to employ any kind of correction for 
multiple hypothesis testing, while using the entire available data to obtain more accurate results, without the need to 
resampling it to generate random datasets. 

Jacquemont et al. JT6 1 presents lower bounds to the size of the dataset needed to simultaneously guarantee that 
both the probability of Type-1 and of Type-2 errors are within desired limits. The analysis presented only consider a 
single item, so one should apply multiple hypothesis correction in order to achieve the desired FWER. 

Teytaud and Lallich [ 39 1 suggest the use of VC-dimension to bound the risk of accepting spurious rules extracted 
from database. Although referring to them as "association rules", the rules they focus on involve ranges over domains 
and conjuctions of Boolean formulas to express subsets of interest. This is different than the transactional market 
basket analysis setting in our work. 
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2.1 The Vapnik-Chervonenkis dimension 

The Vapnik-Chervonenkis dimension was introduced in a seminal work on the sample complexity needed to ap- 
proximately learn a classifier from a given family BP . It has enjoyed a widespread success and it is used in very 
different fields spanning across computer science: learning [5|, computational geometry J2l [TTJ [34), algorithms on 
graphs [1 29|, database management 071 . and frequent itemsets and association rules mining ll36l . Over the years, 
the bound to the sample complexity has been improved and the current best known bound is by Li et al. J3T | (see 
also E4l l. Bounds on the sample complexity involving the empirical VC-Dimension were introduced by Bartlett et al. 
[ 3 1 . Boucheron et al. [7] present a good survey of the field with many recent advances. 

3 Preliminaries 

In this section we introduce with the right level of formalism all the necessary definitions, lemmas, and tools that we 
will use throughout the work. 

3.1 Itemsets mining 

Given a ground set 2" of items, let p be a probability distribution on 2 X . We call a single sample drawn from p a 
transaction. Given a transaction r € 2 X , let |r| denote the number of items in r. A dataset I? is a bag of n transactions 
V = {n , . . . ,r n : Ti C I}, i.e., of n independent identically distributed (i.i.d.) samples from p. Starting from p, we 
can define another function on the members of 2 Z : 

r p( A ) = p^- 

t£2 x ,ACt 

Overloading the namespace a little, we say that r is a function on itemsets, which are just members of 2 X . The 
difference between itemsets and transactions is that a transaction r contains multiple itemsets At, A2, . . . , ^4 2 M-i ( a U 
t's non-empty subsets are itemsets). Given an itemset A we call r p (A) the real frequency of A. 

Define T(A) — {t £ 2 1 : AC r}. The members of the set T(A) are all the transactions built on I that contain 
the itemset A. Analogously, given a dataset T>, let Tt>(A) denote the set of transactions in T> that contain A. The 
frequency of A is the fraction of transactions in T> that contain A: fv{A) — jffi ■ It is easy to see that fx>{A) is the 
empirical average and an unbiased estimator for r p (A). 

In this work, we are interested in finding the itemsets that have real frequency at least 9 for some 9 6 (0, 1]. We 
call the set of such itemsets the Real Frequent Itemsets (RFI's). 

Definition 1. Given a set of items T, a dataset T> of i.i.d. samples from a probability distribution p on the transactions 
in 2 , and a minimum real frequency threshold 8, < 6 < 1, the RFI's mining task with respect to 9 consists of 
finding all itemsets with real frequency > 9, i.e., the set 

RF\{p,l,9) = {A : Ae2 x andr p (A)>9}. 

Note that the definition of RFI's mining task does not depend on the observed dataset T>, but only on p, I, and 9. It 
should be clear that, if one is only given a. finite number of random samples (the dataset T>) from p as it is usually the 
case, one can not aim at finding the exact set RFI(p,Z, 9). Indeed, in this work, we will present a method to extract a 
collection of itemsets for which we can give probabilistic guarantees to only include RFI's. 

Traditionally, the interest has been on extracting the set of Frequent Itemsets (FI's) from T>, given a minimum 
frequency threshold 9. 

Definition 2. Given a dataset T> with transactions built on a ground set X, and a minimum frequency threshold 9, 
< 6 < 1, the FI's mining task with respect to 9 consist of finding all itemsets with frequency > 9, i.e., the set 

F\(T>,1,6) = {A : A e 2 X and f v (A) > 9}. 
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Given that I? is a set of independent random samples from p, no assumption can be made on the set-inclusion 
relationship between RF\(p,l, 9) and F\(T>,I, 9), that is an itemset A in RFI(p,Z, 9) may not appear in F\(T>,1, 9), 
and the opposite may happen as well. 

While previous work has focused on identifying the FI's, in this work we consider the problem of finding the RFI's 
instead, as the final goal of market basket analysis is to gain a better understanding of the process generating the data, 
i.e., of the real frequency function r p , which is only partially and noisily captured by the dataset T>. 

We give here one additional definition that we will use in the analysis of our method. 

Definition 3. Given a collection S of items ets from 2 1 closed with respect to the set inclusion relation, the negative 
border B~(S) consists of the minimal itemsets from 2 X not in S. 

The collection of RFI's and of FI's are always closed with respect to set inclusion. In these cases, the negative 
border consists of all the non (real) frequent itemsets such that all their subsets are (real) frequent itemsets. 



3.2 VC-dimension 

The Vapnik-Chernovenkis (VC) Dimension of a class of subsets defined on a set of points is a measure of the com- 
plexity or expressiveness of such class [41 1. A finite bound on the VC-dimension of a structure implies a bound on the 
number of random samples required to approximate the expectation of each indicator function associated to a set with 
its empirical average. We outline here some basic definitions and results and refer the reader to the works of Devroye 
et al. [12] Sect. 12.4], Boucheron et al. Sect. 3], Alon and Spencer [2 Sect. 14.4], and Vapnik |40) for more details 
on VC-dimension. 

Let Xi = (X±, . . . , Xk) be a collection of independent identically distributed random variables taking values in 
some domain D. For a set A C D, let v(A) — Pr[Xi G A] and let 

1 k 

i=i 

where txjeA is the indicator variable for Xj 6 A. Let J 7 be a collection of subsets from D. We call T a range 
set on D. Given BCD, the projection of B on T is the set 

Pr{X?) = {BnA;AeF}. 

The set B is shattered by T if Pjr(B) = 2 B . 

Definition 4. Given a set B C D, the empirical Vapnik-Chervonenkis (VC) dimension of A on B, denoted as 
EVC(J r , B) is the cardinality of the largest subset of B that is shattered by J-. 

Definition 5. The Vapnik-Chervonenkis (VC) dimension of T is defined as VC(J r ) = EVC(J r , D). 

The main application of (empirical) VC-dimension in statistics and learning theory is in computing the size of the 
sample needed to approximate the v{A) with their empirical averages v X k (A), for all A £ A. 

Definition 6. Given e G (0, 1), a e- approximation for J- is a set S C D such that 

sup \v(A) - v s {A)\ < e. 

An e-approximation can be constructed by random sampling points of the point space. 

Theorem 1. (El Thm. 2.12], see also ED) Let T be a range set on D with VC(J") < v. Given S € (0, 1) and a 
positive integer t, let 

where c is an universal positive constant. Then, a set of £ Ltd. random variables X\, . . . , Xg taking values in D is an 
e-approximation to A with probability at least 1 — 8. 
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Loffler and Phillips [ 33 1 showed experimentally that the absolute constant c is approximately 0.5. 
A similar result holds when an upper bound to the empirical VC-Dimension is available Q. 

Theorem 2. Let J- be range set on D and Xf — (Xi, ■ ■ . , Xt) be a set ofi.i.d. random variables taking values in D. 
Let v be an integer such that EVC(J-~, Xf ) < v. Then, with probability at least 1 — 5, Xf is a e- approximation for X 
for 

£ = 2^f±5 + ^M. (2, 

3.3 Statistical tests 

To identify the real frequent itemsets, we use the framework of statistical hypothesis testing. In statistical hypoth- 
esis testing, one is given a null hypothesis, whose rejection, based on a measure, or statistic, of the observed data, 
corresponds to the identification of a significant phenomenon. 

In our scenario, the naive method to employ statistical hypothesis testing to find the real frequent itemsets is to 
associate with every itemset A a corresponding null hypothesis Ha = "r p (A) < 9", and compute the p- value for such 
null hypothesis using (A) as statistic. In particular, the p-value is given by the probability that the frequency of A in 
a random dataset (with the same number of transactions of T>) sampled from p is > fv(A) conditioning on the event 
"r p (A) < 9". If the p- value is small enough, the null hypothesis is rejected, that is A is flagged as a real significant 
itemset. When only one hypothesis is considered, the null hypothesis is rejected when the p-value is < a, with a 
usually set to 0.05 or 0.01. This guarantees that the significance level, that is the probability of incorrectly reporting 
an itemset as being in RF\(p,l, 9) (type I error), is bounded by a. 

In our scenario, since we are considering a number of itemsets, we are facing a multiple hypothesis testing problem. 
In this case, if we use the same procedure as the single hypothesis case, we could potentially produce a number of 
spurious discoveries [32 1. In order to avoid this issue, a multiple hypothesis correction is employed, like the Bonferroni 
correction [ 1 3 1 (see Section |53j ). 

Definition 7. The Family-Wise Error Rate (FWER) of a statistical test is the probability that at least one false positive 
is reported as significant. 

The Bonferroni correction guarantees that the FWER is bounded by a. Sometimes it is useful to relax the guaran- 
tees given by the FWER, for examples considering the False Discovery Rate. 

Definition 8. The False Discovery Rate [4] (FDR) is the expected proportion of false positives among all itemsets that 
are reported as significant. 

We focus in this work on identifying a highly reliable collections of high quality itemsets, and therefore aim to 
identify real frequent itemsets while bounding the FWER of our method, that is we bound the probability that the 
collection of itemsets reported by our method contains any spurious discovery. 

Another important factor for a statistical test is its power, that is the probability that the test correctly rejects the 
null hypothesis when the null hypothesis is false (also defined as 1 — Pr[type II error]). 



4 A Statistical Test for RFI's 

In this section we present our algorithm to mine a collection of RFI's while guaranteeing that the FWER is within the 
user-specified parameter <5. We use results from statistical learning theory to achieve this goal. We define two range 
sets made of subsets of transactions associated to two different collections of itemsets, and we give upper bounds to 



their (empirical) VC-dimensions. In Section 4.3 we present our algorithm to mine a collection of RFI's with FWER 
at most 8. 
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4.1 The range set of itemsets 

Given a ground set 2" of items, let p be a probability distribution on 2 , and let R% — {T(A); A £ 2 1 } be a range set 
on 2 1 . We call R% the range set of the itemsets. Given any itemset A, it is easy to see that v(T(A)) = r p (A). Let T> 
be a dataset, seen as a collection of i.i.d. samples from the probability distribution p defined on the transactions built 
onX. It should be evident that, for any itemset A, we have vd{A) = f-p(A). 

Riondato and Upfal [36] gave an upper bound to the empirical VC-dimension of R% on V. Before presenting the 
bound, we need to introduce a characteristic quantity of the dataset T> called the d-index and denoted as d(X>). 

Definition 9. ([36 , Def. 12]) Let D be a dataset. The d-index d(D) ofD is the maximum integer d such that T> contains 
at least d transactions of length at least d such that for any two of these transactions r', r" we have neither r' C t" 
nor t" C t'. 

Riondato and Upfal [36| presents an efficient algorithm to compute an upper bound to the d-index of a dataset. 
Theorem 3. ( 11361 Thm. 3]) Let D be a dataset with transactions built on a ground set X. Then EVC(i?x, T)) < d(T>). 

It easy to extend this theorem to prove an upper bound to the VC-dimension of R%. 

Theorem 4. Let T> be a dataset with transactions built on a ground set I. Then VC(i?i) < \T\ — 1. 

Proof. Let k — Assume w.l.o.g. that I = {1, . . . , k}. Let Tj = 1 \ {i}, 1 < i < k. We have | -r^ | = k — 1, for 
1 < i < k. Notice that for i ^ j, neither C ta nor Tj C Tj. Consider a dataset T> made of the set of transactions 
{tx, . . . , tj,} and any number of additional transactions. It is easy to see that d(T>) = k — 1, and that it is impossible 
to build a dataset with a larger d-index using the items of 1, because there is no way to use the items of 1 to build 
I > k — 1 transactions of length at least £ and such that for any two of these transactions t' , t" we have neither t' C t" 
nor t" C t'. This concludes the proof. □ 

Riondato and Upfal [36] also showed that the upper bound presented in Thm. [3] is strict, in the sense that there 
exists datasets for which EVC(i?x, 2?) = d(V ). This implies that the upper bound presented in Thm.plis also strict. 

The upper bound to the empirical VC-Dimension of Rx on V is used by Riondato and Upfal [36 1 to compute 
the size of a sample S needed to extract a superset of the collection of F\(V,1, 8) from S, where the quality of 
the approximation is controlled by user-defined parameters e and S. In this work we will use the two upper bounds 
presented above to compute a superset of the negative border of the collection of RFI's with respect to a minimum real 
frequency threshold 8. 



4.2 The range set of the negative border 

We now define another range set that we will call the range set of the negative border. Given a minimum real frequency 
threshold 8, let Bq — B~ (RFI(p,I, 8)), the negative border of the RFI's with respect to theta. We define the range set 
of the negative border as 

R B -={T(A);AeBg}. 

Similar to the case of the range set of the itemsets, we have, for an itemset A 6 Bg, v{T{A)) = r p (A) and, for a 
dataset 2?, v v {T(A)) = fv(A). 

The intuition behind the definition of this range space is the following. Assume that we can compute an upper 
bound to the (empirical) VC-dimension of this range space. Then, given a dataset T> we can apply Theorem [T] or 
Theorem|2jto compute an offset parameter e such that T> is a ^-approximation to R B - with probability at least 1 — 5 

for some user-specified parameter 8. Now, if T> is indeed a e-approximation to R B - , then no itemset A £ Bg can have 



a frequency /x> (A) in V greater than 8 + e. We will prove this formally in Sec. 4.3 and use this fact as the core step in 



the analysis of the correctness of our method to compute a collection of RFI's with FWER at most 5. 
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4.2.1 An upper bound to the VC-Dimension of R B - 

It is easy to see that is a subset of R% for any E [0, 1], so we will have EVC(i? B - , T>) < EVC(i?j, D) for any T> 
and consequently, VC(i? e - ) < VC(i?x)- This means that one could apply Theorems pi and Pfl to bound EVC(i? e - , V) 
and \/C(R B - ), but the resulting bounds can be extremely loose. In this section, we snow a aifferent way to compute 
upper bounds to these quantities. Our experimental results show that the approach we present here is practical and 
gives much tighter bounds. 

Lemma 1. Let Q be a set of transactions and let v be the maximum integer for which there are at least v transactions 
n, . . . , t v such that: 

1. For any pair of transactions (r,, Tj), i ^ j, we have n % Tj and Tj % Ti, and 

2. t v contains at least 2 V ~ 1 itemsets from Bq . 
Then EVC(i? B - ,Q)<V. 

Proof. The first requirement guarantees that the set of transactions considered in the computation of v could indeed 
theoretically be shattered. Assume that a subset F of Q contains two transactions t' and r" such that w.l.o.g. r' C r". 
Any itemset from Bq~ appearing in r' would also appear in t", so there would not be any itemset A € Bq such that 
t" € T(A) n F but t' £ T(A) n F, which would imply that F can not be shattered. Hence sets that do not respect 
requirement 1. should not be considered. This has the net effect of potentially result in a lower v, i.e., in a stricter 
bound to EVC{R B - ,Q)<v. 

Let £ > v and consider a set L of £ transactions from Q such that requirement 1. holds. Assume that L is shattered 
by R B -- Let r be a transaction in L. The transactions t belongs to 2 l ~ x subsets of L. Let K C L be one of these 

subsets. Since L is shattered, there exists an itemset A E Bq such that T(A) fl L = K. From this and the fact that 
t € K, we have that r E T(A) or equivalently that A C r. Given that all the subsets K containing r are different, 
then also all the T(A) such that T(A) n L = K should be different, which in turn implies that all the itemsets A E Bq 
should be different and that they should all appear in r. There are 2 1 ^ 1 subsets K containing r, then t must contain 
at least 2 l ~ 1 itemsets from B7, and this holds for all £ transactions in Q. This is a contradiction because £ > v and v 
is the maximum integer for which there are at least v transactions containing at least 2 V ~ 1 itemsets from Bq . Hence 
L cannot be shattered and the thesis follows. □ 

An exact computation of v defined as in Lemma [TJ could be extremely expensive as it would require to scan the 
transactions one by one and compute the number of itemsets from Bq appearing in each transaction. Moreover, the 
negative border Bq is not known in advance, so such a computation would not be possible. 

Let still assume we know Bq . We now present an algorithm to compute an upper bound to v. The algorithm will 
only need minor modifications when the assumption to know Bq will be drop. 

The Set-Union Knapsack Problem We first introduce the Set-Union Knapsack Problem (SUKP). 

Definition 10 ([19]). Let U = {a 1; . . . ,ag} be a set of elements and let S = {Ai, . . . , A^} be a set of subsets ofU, 
i.e. Ai C U for 1 < i < k. Each subset Ai, 1 < i < k, has an associated non-negative profit p{A{) E K + , and each 
element a,j, 1 < j < £ as an associate non-negative weight w(aj) E Given a subset S' C S, we define the profit 
of S' as P(S') — es' piAi)' ^ ef Us' — U^es'^- We define the weight of S' as W(S') — 2~2 a -eU , w ( a j)- 

Given a non-negative parameter c that we called capacity, the Set-Union Knapsack Problem (SUKP) requires to 
find the set S* C S which maximizes P(S') over all sets S' for which W(S') < c. 

The SUKP is NP-hard in the general case, but there are known restrictions for which it can be solved in polynomial 
time using dynamic programming [iT9|. It is not clear whether our case falls into one of these restrictions but we found 
that available optimization problem solvers can compute the optimal solution reasonably fast even for very large 
instances with thousands of elements and tens of thousands of subsets. 

In our case, U = I, S = Bq , w(a,j) = lVcij E I, and p(A) = 1VA E Bq . The capacity c will be a number 
denoting a transaction length. It should be clear that the profit P(S*) of the optimal solution S* to this problem will 
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denote the maximum number of itemsets from B g that can appear in a transaction of length c. Then, the SUKP can be 
formulated as a Binary Integer Program as follows. 

Constant: Non-negative integer capacity c. 

Variables: 

• Integer indicator variable Xa for each A E B@, taking values in {0, 1}. 

• Integer indicator variable Y a for each a EL, taking values in {0, 1}. 

Objective: maximize the function Y^AeB~ ^a- 
Constraints: 

• Y a — Xa > for each itemset A E Bg and for each item a E A. 

Computing the upper bound Given a dataset T> let now ■ ■ ,£ w be sequence of the transaction lengths of 
transactions in T>, i.e., for each value I for which there is at least a transaction in T> of length I, there is one (and 
only one) index i, 1 < i < w such that £j = I. Assume that the £;'s are labelled in sorted decreasing order: 
l\ > £i > ■ ■ ■ > l w . Let now L i7 1 < i < w be the maximum number of transactions in T> that have length at least 
li and such that for no two r', r" of them we have either r' C r" or t" C t' . The sequences and a sequence 
(L*) w of upper bounds to (Li)f can be computed efficiently with a scan of the dataset . 

AlgorithmjTjpresents the pseudocode of our method to compute an upper bound to EVC(i? B - , V). The solve_SUKP (U,S,c) 
routine computes the optimal objective value for the SUKP problem defined on the set of elements U, the set of subsets 
S, with capacity c and unitary profits and weights. 



Algorithm 1: computes an upper bound to EVC(i? B - , T>) 

Input : a dataset T> with transactions built on a ground set I, a minimum probability mass threshold 9. 
Output: the empirical b-index of T> with respect to 9 

1 for i <— 1 to w do 

2 q <- solve_SUKP (1,Bq 

3 b <- [log 2 q\ + 1 

4 if Li > b then 

5 | return b 

6 end 

7 end 



We can formulate the following lemma, whose correctness easily derives from the discussion above. 
Lemma 2. Let b be the value returned by Algorithm^ Then EVC(i? g - ,T>) < b. 

Consider now the SUKP defined on X and Bg with unitary weights and profits and with capacity c = \X\ — 1. Let 
q* be its optimal solution and let b* = [\og 2 q* J +1. The following lemma is an easy consequence of this definition 
and of Lemma |2 

Lemma 3. VC(i? B -) < b*. 
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Removing the assumption on Bg Until now we assumed to know Bg . This assumption is not realizable in practice: 
if we knew the negative border exactly, we would also know the collection of RFI's, as one uniquely identifies the 
other and vice versa. We now remove this assumption and present a slight modification of the method presented in the 
previous section to compute an upper bound to the (empirical) VC-dimension of R„- . 

Let B be a superset of Bg . It should be evident that if we replace Bg with B in the SUKP problems described 
above, then the optimal objective function values these modified problems will not be smaller than the optimal ob- 
jective function value of the original problems, given that the original optimal solutions are feasible solutions of the 
modified problems. 

Assume that we can compute a superset B of Bg (we will show how to do it in Section 



4.3 i. One drawback with 



the approach we just outlined would be that the resulting bound to the (empirical) VC-dimension could be much larger 
than the original bound. This is due to the fact that we are not enforcing an additional constraint on the desired structure 
of the optimal solution that is implicit when only considering Bg\ the fact that the negative border is an antichain. 
Given a universe U, an antichain is a set T of subsets of U such that for no two of these subsets X' , X" £ T, we have 
X' C X" or X" C X'. 

We modify the formulation of the SUKP to restrict the set of feasible solutions to only contain antichains. As in 
the original formulation fo the SUKP, let U and S be a set of elements, and a set of subsets of U, respectively. A chain 
on the members of S is a subset {A\,A2, ■ ■ . , A w } C S such that A\ c A2 C • • • C A w . A maximal chain on the 
members of S is a chain to which it is impossible to add any other member of S without removing the chain property. 
Let Cs denote the set of all the maximal chains on the members of S. We modify the definition of the SUKP defined 
on U, S by including the following set of constraints: 

• VCeC s ,E A ec x A < 1. 

We call this modified version of the SUKP the Antichain SUKP (ASUKP). 

Let B be a superset of Bg . We modify Algorithm [I] by replacing the call to solve_SUKP (X,Bg,£i) with 
a call to the routine solve_ASUKP {X,B,£i, which computes the optimal value of the objective function for the 
ASUKP defined on 1 and B, with capacity l.- L and unitary weights and profits. It should be evident that this value 
is not smaller than the value one would obtain using the original call to solve_SUKP (I, Bg , ti ) because the op- 
timal solution to the problem corresponding to this call is still a feasible solution for the problem corresponding to 
solve_ASUKP (I, BJi) . 

Definition 11. Let b be the solution returned by the modified Algorithm^ We call b the empirical b-index of Bg on 
T> using B and denote it with eb{Bg , T>, B). 

This lemma then follows easily from the above discussion and Lemma|2] 

Lemma 4. EVC(i? B - , V) < eb(B e , V, B). 

We now proceed to bound VC(i?, g - ) in a way similar to what we did when we assumed to know Bg . Let q* be the 
optimal value of the objective function for the ASUKP defined on I and B with capacity \X\ — 1 and unitary weights 
and profits. Let b* = [log 2 q* \ + 1. 

Definition 12. We call b* the b-index of Bg using B and denote it with b(Bg , B). 

We can now conclude with the following lemma, which is a direct consequence of the above discussion and 
Lemma [3] 

Lemma 5. VC(R B -) < b(B ,B). 
4.3 The statistical test 

In this section we present our method to compute a collection of RFI's while controlling its FWER. We then prove its 
correctness using the concepts and results defined in the previous section. 



The intuition behind our statistical test is extremely simple: we first use the results from Section 4. 1 to compute a 



lowered frequency threshold 9 — e' such that the collection of FI's at that threshold is a superset of the collection of 
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RFI's. We then use this collection as the basis to compute a superset to the negative border of the collection of RFI's. 
From here, using the results from Section 4.2. 1 we proceed to compute upper bounds to the (empirical) VC-dimension 
of the range set of the negative border. Finally, we compute another offset parameter e" so that the FI's with frequency 
in the dataset at least 9 + e" are all RFI's, with probability at least 1 — 5. The pseudocode for the statistical test is 
presented in Algorithm[2] 



Algorithm 2: A statistical test for RFI's 
Input : a dataset T>, a minimum probability mass threshold 9, a FWER bound 5. 
Output: a collection of itemsets A with r(A) > 9 such that FWER < S. 

1 5', 5" 4-1- Vl^S 

2 di = \X\ - 1 

3 d 2 — get_d-index (T>) 

4 si <- get_epsilon_VC (efi, \T>\, 5') 

5 e' 2 4- get_epsilon_empVC (d 2l \D \, 6') 

6 e' <— min{ei, e' 2 } 

7 X 4- F\(V,l,9-e') 

s Z4- (X\F\(V,1,9 + e')){JB-(X). 

9 b\ 4- get_b-index (\X\ — 1,Z) 
10 b 2 4— get_emp_b-index (I,V,Z) 
n e'{ <- get_epsilon_VC [bx, \V\,5") 

12 4' <- get_epsilon_empVC (b 2 , \V\,5") 

13 e" 4- min{e", e 2 '} 

14 return Fl (X>, J,0 + e") 



The routine get_epsilon_VC in line|4](resp. get_epsilon_empVC in line line[5]l takes as input an upper 
bound d* to the VC-dimension of R% (resp. to the empirical VC-dimension of R% on T>), a dataset size \T>\* , and a 
real S* £ (0, 1), and uses ([1} (resp. Q) to compute a value e* such that a dataset of size \T>\ is a e* -approximation to 
a range set with VC-dimension (resp. with empirical VC-dimension) at most d* with probability at least 1 — 8*. 

The routine get_d-index computes (an upper bound to) the d-index of a dataset, as described 
in l36l . The routine get_b-index (resp. get_emp_b-index) computes b(B~ (RF\(p,1, 9)), Z*) 
(resp. eb(S-(RFI(p,1, 9)), V, Z*% the b-index of 2?~(RFI(p,T, 9)) (resp. the empirical b-i ndex o f B~ (RFI(p, J, 6>)) 
on I?) using a superset Z* of #~(RFI(p,1, 6*)). We described these computations in Section ■ 



4.2.1 



Lemma 6. With probability at least 1 — S',"D is an e' -approximation to Rx- 

Proof. Suppose e' = e[. From Theorem|4]we know that VC(f?i) < di(= \X\ — 1). The thesis follows from the 
definition of e\ (through the routine get_epsilon_VC(-)) and Theorem[T| 

Suppose instead that e' — e' 2 . From Theorem[3]we know that EVC(i?x, D) < d 2 , given that d 2 is (an upper bound 
to) the d-index of T>. The thesis follows from the definition of e' 2 (through the routine get_epsilon_empVC(-)) 
and Theorem [2] □ 

Lemma 7. IfD is a e' -approximation to Rx, then S _ (RFI(p,I,0)) C Z, where Z is defined as in line^of Algo- 
rithm^ 

Proof. From the hypothesis we have that, for every itemset A £ Z, fv(A) £ [r(X) — e';r(X) + e']. Consider 
B £ B-(RF\(p,l,6)). This by definition implies r(B) < 9. 

Iffv(B) >6-e', since r(B) < 9, then f v (B) <6 + s', that is B £ Z. 

Assume instead that f v (B) < 9 - s' . Since B £ B~ (RF\(p,X,6)), then for all C £ 2 B \ {B} we have 
r(C) > 9, therefore for all C £ 2 B \ {B} we have f v (C) > 9 - e' , which implies B £ B~{f\(pl, 9 - e')). Since 
B-{F\{V1,6 -s')) c Z, thenB £ Z. □ 

Lemma 8. Assume that R B - (r?\( p ,im)) C Z; then VC(-R B -( RF |( PiI) 0))) < h and F\/C(R B -^ RR(p x,e)), D) <b 2 . 
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Proof. The result follows from Lemma|5]and Lemma|4] 



□ 



Lemma 9. With probability at least 1 — 8, T> is an e" -approximation to Rb- (RF\(p,ifi))- 

Proof. Assume that T> is an e\ -approximation to Rx- From Lemma [6] we know that this happens with probability at 
least 1 — 5' . Then, Lemma [7] holds, and therefore Lemma[8]also holds. Assume now that e" — e'{. Then the thesis 
follows from the definitions of e'{ and of <5", and from Theorem [I] If instead e" = e 2 , the thesis follows from the 
definitions of e 2 and of 5" and from Theorem [5] □ 

We conclude the analysis of the correctness of our method with a corollary which follows from the property of an 
^''-approximation (Def.|6|. 

Corollary 1. The FWER of the statistical test implemented by Algorithm^is at most S. 



5 Experimental evaluation 

We implemented our algorithm and conducted an extensive evaluation to assess its practical applicability and its 
statistical power. In the following sections we describe the methodology and present the results. 



5.1 Implementation 

We implemented Algorithm [2] in Python 3.3, except for the mining algorithm and the optimization problem solver. 
Our statistical test is agnostic to the choice of mining algorithm used to extract the collection of FI's at frequency 
9 — e' (line[7]of Algorithm]^. We used the implementation by Grahne and Zhu l20l (written in C) available from the 
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FIMI'03 implementation repository^ Our solver of choice for the ASUKP's to compute the b-indexes (lines [9]and 
from Algorithm^ was IBM® ILOG® CPLEX® Optimization Studio 12.3, called through the Python 2.7 API. We 
executed our experiments on a number of machines with x86-64 processors running GNU/Linux 2.6.32 



5.2 Datasets generation 

We evaluated our statistical test using five different datasets (accidents IfTBI . BMS-POS, kosarak, pumsb*, and 
retail 0) from the FIMI repository These datasets have different characteristics and different distribution of the 
frequencies of the itemsets lfl8l . 

For currently available datasets the ground truth is not known, that is, we do not know the distribution p, hence 
neither the real frequencies r p (-) of the itemsets, which we need to evaluate the performances of our statistical test. 
Therefore to have large datasets and the corresponding probabilities r p ( ), we created new datasets starting from the 
ones from the FIMI repository by sampling transactions uniformly at random until a desired size (in our case 20 million 
transactions) has been reached. This way, the ground truth is given by the frequencies of the itemsets in the original 
datasets: if the frequency of itemset X in the original dataset was f{X), then we assume r p {X) — f(X). Notice that 
the expected frequency of an itemset in the enlarged dataset is r p (X), independently on the distribution. Given that 
our method to control the FWER is distribution-independent this is a valid way to establish a ground truth. 



5.3 Parameters evaluation 

Firstly, we evaluated the most important parameters used by our statistical test: the upper bound d\ = \X\ — 1 to the 
VC-dimension of the itemsets, the d-index d 2 of the dataset (which is an upper bound to the empirical VC-dimension 
of the itemsets on the dataset), the two candidates e\ and e' 2 for the first offset parameter e' = minl^, e 2 }, the non- 
empirical and empirical b-indexes b\ = b(B~ (RF\(p,l, 9)), Z) and b 2 = eb(B~(RF\(p,I,9)),T>,Z), and the two 
candidates e'{ and e 2 for the second offset parameter e". The first four only depend on the ground set 1 and on the 
dataset T>, while the others also depend on the minimum frequency threshold 9. Table [T] reports the value for these 

1 http : //f imi . ua ■ ac .be/src/| 
' http: //f imi . ua . ac .be/data/ 
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Dataset 


e 


\T\ - 1 






- 2 




eb(Bn V Z) 




- 2 


accidents 


0.4 


467 


46 


0.00342 


0.01819 


12 


12 


0.00061 


0.00958 


BMS-POS 


0.01 


1656 


81 


0.00644 


0.02394 


13 


13 


0.00063 


0.00995 


kosarak 


0.04 


41269 


442 


0.03212 


0.05512 


16 


11 


0.00068 


0.00920 


pumsb* 


0.45 


2087 


59 


0.00722 


0.02052 


12 


10 


0.00061 


0.00880 


retail 


0.022 


16469 


58 


0.02029 


0.02035 


15 


9 


0.00062 


0.00838 



Table 1: Bounds to (empirical) VC-dimensions and offset parameters 



parameters in an enlarged version of the datasets from the FIMI repositories. Throughout all our experiments we used 
6 = 0.1. 

The first interesting result is that it is clear that using the d-index of the dataset or the quantity \1\ — 1 as upper 
bounds to the empirical b-index and to the b-index would be extremely conservative. In some cases, like for the 
kosarak dataset, the b-indexes are orders of magnitude smaller than the d-index, not to say than \1\ — 1. 

Secondly, we can see that in most cases d 2 <C \T\ — 1, and this will be especially true in datasets from market 
basket processes like those from large electronic commerce websites, where the number \1\ of items is huge but most 
of the transactions are very short because the majority of customers buy a limited number of products. A germane 
example of this situation is the retail dataset, which "contains the (anonymized) retail market basket data from an 
anonymous Belgian retail store'j^] for this dataset, d 2 is almost three orders of magnitude smaller than d\. Despite 
the fact that even for this dataset we have e\ < e' 2 , there are indeed cases for which the opposite relation is true. For 
example, if the dataset size were 15 million transactions instead of 20 million, we would have, for retail, using the 
same values for d\ and d 2 , e[ = 0.02343 and e' 2 = 0.02330. 

This points out that considering an upper bound to the empirical VC-dimension when computing the first offset 
parameters is reasonable and can be useful even if e 2 is more dependent on the size of the dataset than e[. In any case, 
it is clear that the correction e' to the minimum frequency threshold 8 is, in all cases, small and thus practical. 

From Table [T] it is possible to appreciate that there can be a gap, although small, between the non-empirical and 
the empirical b-indexes. The fact that the difference between these two parameters is small implies that e'{ is smaller 
than e 2 because, as we said before, e'[ is less dependent on the size of the dataset. The resulting parameter e" is very 
small, to the order of 10~ 3 . This fact should intuitively suggests that the statistical test should have good statistical 



power. This is indeed the case as we will see in Section 5.5 



5.4 Control of the FWER 

We evaluated the capability of our method to control the FWER by first creating a number of enlarged datasets with 
20 million transactions each as described in Section [3T2] and then using Algorithm [2] to extract a collection of RFI's 
using a range of different minimum real frequency thresholds 6, with FWER 6 = 0.1. We repeated each experiment 
multiple times on different datasets generating from the same distribution. In all the hundreds of runs of our algorithm, 
the returned collection of itemsets always only contained RFI's and never contained any false positives, suggesting 
therefore that not only our method can control the FWER effectively, but it is also more conservative than the guaran- 
tees obtained from the theoretical analysis, since the returned collection never contained RFI's even if the upper bound 
5 to the probability of such event was set to 0.1. This is expected as the (empirical) b-index is not always a tight bound 
to the (empirical) VC-dimension. 



5.5 Statistical power 



As described in Section 3.3 the power of a statistical test is the probability that the test will reject the null hypothesis 
when the null hypothesis is false. In most cases, it is difficult to analytically quantify the statistical power of a test, 
especially in the case of multiple hypothesis testing when the hypotheses are correlated. This is indeed the case for 
the RFI's problem, and we therefore conducted an empirical evaluation of the statistical power of our method by 



3 Quote from the dataset description on the FIMI page. 



14 



assessing what fraction of the total number of RFI's is reported in output, using the procedure described in Section 5.2 



to create datasets with 20 million transactions for which we know the actual distribution of RFI's. We fixed S — 0.1 
and repeated each experiment 20 times, finding very small variance in the results. 

We also wanted to compare our power with that of established methods to control the probability of Type-1 errors. 

A classical and widespread method to control the FWER is the Bonferroni correction lfT3ll . This method is ex- 
tremely easy to apply and is also known to be excessively conservative: given a number k of hypothesis hi, 1 < i < k 
to be tested, to ensure that the FWER is at most a, one can fix, a priori, k positive weights a,, 1 < i < k, such that 
*}2,a.i = a and test each hypothesis independently using on as the critical value, i.e., rejecting the hypothesis if the 
p-value resulting from its testing is greater than ot. L . The most common way to set the weights is to have on = a/k. In 
our case, we have k = 2^ — 1, i.e., the number of itemsets that can be built on the ground set X. In most cases, espe- 
cially the most interesting ones, \X\ can be in the order of thousands, as can be seen from Table[T](di = \X\ - 1). This 
implies that k is extremely high (to the order of 10 300 when \X\ is in around 1000) and conversely 1/k is extremely low 
(10 -300 ), as would be the weights on = a/k. We tried to employ the Bonferroni correction and the exact Binomial 
test on itemsets X with fv{X) > 9 using r p (X) < 9 as null hypothesis to extract a collection of FI's with FWER 
S and compare the statistical power of this method with ours, but we hit the computational precision limits of our 
platform (quad-core AMD Phenom™II X4 955 Processor running GNU/Linux 2.6.32-5-amd64). Specifically, double 
precision (using the GNU C Compiler long double type, with 80 bits precision) was not enough to represent the 
small values taken by the p-values and the weights. We already argued about why the weights are very small (mostly 
due to k — 2I 1 ' — 1), while the p-values take very small values due to the very high values of \T>\, which are typically in 

the order of hundred of thousands or millions. A software implementation of quad precision (the GCC f 1 oat 1 2 8 

type, offering 128 bits of precision) made the computation too slow to be practical, especially due to the lack of fast 
numerical libraries supporting such high precision. Even the use of an upper bound to the number of potential RFI's 
with respect to the minimum real frequency threshold [ 10 Lemma 4.1] does not help. We therefore conclude that the 
direct adjustment of weights using the Bonferroni procedure is not applicable for practical reasons to the case of RFI's. 
Similar conclusions can be taken for the use of the Holm-Bonferroni procedure 11251 . a less conservative variation of 
the one described above, with the additional drawback of having to compute the frequencies of all itemsets in the 
datasets, which is computationally extremely expensive. Layered Critical Values 11431 . a recently -prosed technique to 
control the FWER in pattern discovery while reaching good statistical power, suffered from the same computational 
precision issues as the Bonferroni correction, making it unappealing for RFI's discovery when the search space is huge 
(as we said, approx. 10 300 ) and there is no upper limit to the maximum size of an itemset. 

We stress again that the method from Il23l can not be applied to our problems, as we mentioned in the review of 
the previous work in Section[2] 

We compared our algorithm with the holdout method [42 1 that reduces the search space with the goal of mitigating 
the correction necessary for the multiple hypotheses test. In the holdout method, the dataset is split into two portions, 
an exploratory dataset and an evaluation dataset. The exploratory dataset is mined and itemset are extracted from it, 
provided they pass a statistical test without any correction, i.e., with critical value a. Then, these same patterns are 
tested on the evaluation dataset using a critical value a/k corrected for multiple hypotheses testing, where k is the 
number of patterns found in the exploratory step. We implemented the holdout method using the exact Binomial test 
and the Bonferroni correction and compared the power of this method with that of our statistical test. 

The results of the statistical power evaluation and of the comparison with the holdout method are presented in 
Table [2] It is possible to appreciate that the power of Algorithm [2] is very high by itself even when a large number 
of RFI's is present. In comparison with the holdout method, we can see that there are datasets, distributions, and 
frequencies at which the statistical test presented in this work performs better than the holdout, and other at which 
the holdout performs better than our method. The two approaches are therefore comparable and neither has a clear 
edge over the other. We can nevertheless see a pattern: when the number of RFI's grows, the direct correction applied 
within the holdout method becomes less and less powerful because it cannot take into account the many correlations 
between the itemsets. Given that in common applications the number of frequent itemsets of interest is large, in real 
scenarios our method is more effective in extracting the real frequent itemsets while rigorously controlling the false 
discoveries. 

We remark that even if Liu et al. If32l found that the holdout method has less power than the direct adjustment or 
resampling methods, we already argued that these are not appliable to the RFI's problem or practical in our settings. 
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Dataset 


9 


RFI's 


Statistical power (%) 
Algorithm]^ Holdout 


accidents 


0.4 


32528 


99.157 


98.831 


BMS-POS 


0.01 


1099 


88.080 


97.088 


kosarak 


0.04 


42 


95.238 


97.619 


pumsb* 


0.45 


1913 


97.680 


97.543 


retail 


0.022 


47 


95.744 


97.872 



Table 2: Statistical power 



Real frequency estimation error 



Dataset 9 maximum average 



accidents 


0.4 


4.3' 


10" 


-4 


9.2 • 


10" 


-5 


BMS-POS 


0.01 


2.1 • 


10 


-4 


2.4- 


10" 


-5 


kosarak 


0.04 


2.1 • 


10 


-4 


4.6' 


10" 


-5 


pumsb* 


0.45 


3.7' 


10" 


-4 


9.7' 


10" 


-5 


retail 


0.022 


2.3 


10" 


-4 


4.0. 


10" 


-5 



Table 3: Real frequency estimation error 



It is also important to stress that the statistical test presented in this work can be applied in cases when the holdout 
method is not a viable option, for example when the dataset can not be split into two parts at random. As argued 
by Hanhijarvi [23], this may be the case for network and spatial data. 

Note that existing methods to control the FDR [4] would also suffer from the same issues as the Bonferroni 
correction, and therefore be too computationally expensive to be practical when only a fraction of all possible itemsets 
are RFI's. 

5.6 Real frequency estimation 

Recalling the definition of an e-approximation to a range set R (Def.|6]l one may ask how close the frequency of an RFI 
in the sample is to its real frequency in the distribution. It is important to note, though, that the error bound expressed 
in the definition of an ^approximation is only valid for the members of R. In the case of the RFI's, the itemsets 
A such that T(A) £ -Rfi-(RFi(p,z,e)) stre not RFI's. Therefore, provided the dataset is an ^'-approximation, none of 
them will be included in the output collection of itemsets computed by our method. This implies that our method can 
not guaranteed any bound on the error \fx>(X) — r p (X)\ for the itemsets included in the output. Nevertheless, the 
frequencies in the sample are very close to the real frequencies, orders of magnitude closer than e. In Table|3]we report 
the maximum and the average error \fx>(X) — r p (X) \ for the same datasets we used in the evaluation of the statistical 
power of our method, with 6 = 0.1. 

6 Conclusions 

In this work we consider the problem of mining Real Frequent Itemsets (RFI's), that are itemsets whose probability of 
appearing in a transaction sampled from a distribution is at least 9, where 9 is a user-specified parameter, given only 
a transactional dataset that is a collection of independent identically distributed samples from the distribution. Since 
every transactional dataset represents only a finite observation from the process for which knowledge is inferred using 
the mining task, this is a fundamental problem in data mining, but has received scant attention in the literature. 

In this paper we present an algorithm to mine a collection of RFI's while providing guarantees on the quality of the 
returned collection. In particular, our method guarantees that the probability of any spurious discovery (i.e., an itemset 
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in the collection whose probability is below 6*) is within the user-specified parameter S. Our method is orthogonal 
to the numerous methods that have been previously proposed in the literature to control false discoveries among the 
frequent itemsets, since such methods focus on the itemsets that are frequent in the dataset, assuming that they reflect 
the RFI's. 

To identify a high quality collection of real frequent itemsets, we use results from statistical learning theory. We 
define a range set associated to the problem of mining RFI's, and give an upper bound to its (empirical) VC-dimension, 
showing an interesting connection with a variant of the knapsack optimization problem. To the best of our knowledge, 
ours is the first work to apply techniques from statistical learning theory to the field of RFI's, and in general the 
first application of the sample complexity bound based on empirical VC-Dimension to the field of data mining. We 
implemented our test and evaluated its performances on large datasets generated from available transactional datasets. 
We found that it is very efficient on multiple metrics. Our method can control the FWER even better than what 
is guaranteed by the theoretical analysis. We evaluated its statistical power and compared it to the power of other 
available statistical tests adapted to the discovery of RFI's, and found it not only very high in absolute terms, but also 
comparable or even better than the power of state-of-the-art tests. 

There are a number of directions for further research, including how to improve the power of methods to identify 
RFI's, and studying lower bounds to the VC-dimension of the range set we define to mine RFI's. Moreover, while 
this work focuses on itemsets mining, further research is needed for the extraction of more structured (e.g., sequences, 
graphs) real frequent patterns. 
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